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METHOD AND SYSTEM FOR ANALYSIS OF CANCER 
BIOMARKERS USING PROTEOME IMAGE MINING 



5 TECHNICAL FIELD 

The present invention relates to a method of mining of meaningful biomarker spots in 
a specific disease and diagnostic screening of diseased state by transforming each of the 
separated states of serum proteins from a plurality ofnonnal and diseased living individual on a 

10 2D(2 dimensional)-gel into an image, producing a disease-specific serum proteome standard 
(proteome pattern) by an image mining technique, and comparing proteome of a subject 

orgaaism-wka-preteeme-standaids of uumial oriiise^lnalviauals. The present invention " 



is also concerned with a system introducing a method of screening cancer. More 
particularly, the present invention relates to a system and a method for early detection of 
15 cancer, which are capable of identifying proteome pattern of a specific cancer by producing 
serum proteome standards by an image mining technique and then comparing the proteome 
of a subject with the proteome standards. Further, the present invention relates to a 
proteome pattern for a specific cancer type, comprising one or more specific serum proteins, 
which can be used as a cancer-specific biomarkers in such a system or method for cancer 
20 diagnosis. 

•V V 

BACKGROUND ART 

. Recently, with the rapid development of bioinformatics and analysis techniques of 
25 DNA sequences, a large volume of genomic data of humans, animals, plants and 
microindividuals has become known, thus giving rise to a broad range of industrial applications, 
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including diverse research fields, such as development of new pharmaceutical preparations, 
new diagnostic tools for diseases and production of genetically modified plants. 
Bioinformatics, which is a technique of rapidly and effectively processing a large volume of 
data through fusion of Biotechnology (BT) and information technology (IT), can collect, save 
5 and analyze a large volume of information carried by the living individual, apply the resulting 
data to a wide variety of fields, such as pharmaceuticals, foods, agriculture or environmental 
engineering, thereby creating high-value products. 

As a result of completion of the human genome project and the development of 
bioinfbmatics,- it was found that genes play a critical role in determining cause-effect 
10 relationship of diseases in humans and phenotypes of humans. That is, in spite of having 
almost similar DNA sequences, humans show differences in their appearance, height, character, 
^d-fearures^^^dividu^^ 



by the environmental factors. 

In this regard, human genome and clinical data obtained using the same can be applied 
15 to treat incurable diseases such as cancer, where, in case of cancer, much better therapeutic 
effects are expected if discovered at the early stage. Urines, tears, saliva, etc., have been used 
for detection of diseases at the early stage, and recently, serum proteomes are often used. 

Multifactorial disease, like cancer, is developed by combinatorial action of genetic 
factors and environmental factors. For the diagnosis and prognostic evaluation of cancer, 
20 overall proteome changes accompanied with cancer development, progression and malignant 
degeneration of cancer must be analyzed. In case of cancer, influenced by not one or two 
kinds of abnormal cells or tissues, but by abnormal function due to its involvement of several 
organs, body fluids such as serum- are suitable as biological samples capable of indicating 
changes in proteome. Especially, in case of diseases which are difficult to diagnose in the 
25 early stage (and prognose), such as lung cancer, blood serum is considered as an optimal 
sample because of being easily obtainable and widely used in clinical tests. 
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When comparing components of a serum proteome of a normal human with that of a 
cancer patient, protein composition of the serum proteome is predicted to differ. However, at 
present, the specific differences in protein compositions are unknown. Although the human 
genomic map was completed by the human genome project, there is still no information 
5 precisely identifying the relationship between genes and proteins expressed from the 
information encoded in the genes. 

In particular, diseases such as cancer are induced by specific modifications of specific 
genes, and such modifications are thought to evoke changes in the protein composition of the 
serum proteome. Through analysis of such a change of composition of the serum proteome, 
10 diseases such as cancer can be discovered at the early stage, as disclosed in the prior art. For 
example, PCT Application No. PCT/AU01/00877, filed in July 19th, 2001 by Rarish, 

Oiristophe^k^ 

comparing a profile of molecular species in a serum sample from a human or animal subject 
having cancer with that in a serum sample from a healthy human or animal subject using a 
15 mass spectrometry-based method, and their use as cancer markers. In detail, disclosed is a 
method of identifying a cancer marker, comprising the steps of (i) separating a blood fraction ' 
from a human or animal subject having cancer by mass spectrometry; (ii) separating a blood 
fraction from a healthy human or animal by mass analysis; and (iii) comparing a profile of 
molecular species at step (i) with that at step (ii) and identifying increased or reduced molecular 
B0 species, wherein an increased or reduced level of the molecular species indicates that the 
molecular species is a cancer marker. 

PCT Application No. PCT/US01/28133, filed in Sep, 7th, 2001 by Yip, Tai-Tung et aL, 
discloses a novel protein marker for diagnosis of breast cancer, which was discovered using 
Surface-Enhanced Laser Desorptiqn/Ionization (SELDf) mass spectrometry, in which a breast 
5 cancer patient and a normal human can be distinguished by determining presence or absence, 
the amount and detected frequency of the protein marker. 
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Recently, Emanuel F, Petricoin HI, et aL (Lancet, 359:572-577, 2002) reported a 
proteome pattern of ovarian cancer patients, which is obtained using Surface-Enhanced Laser 
Desoiption/Ionization-Time of Flight (SELDI-TOF) mass spectrometry and differs from that of 
normal humans, and such a proteome pattern can be applied for diagnosis of ovarian cancer 



However, all of the above-mentioned research utilized SELDI-TOF or MALDI-TOF 
mass spectrometry, which is a one-dimensional analysis pattern, to find cancer-specific serum 
molecular species including proteins useful as cancer markers, where only the factor 'mass' 
was used in comparing serum proteins from a cancer patient with those of a normal human, and 
10 cancer-specific serum proteins are determined only by evaluating increased or reduced levels of 
a large number of serum proteins. Therefore, such a method of detennining cancer-specific 



DISCLOSURE OF THE INVENTION 

Leading to the present invention, the intensive and thorough research for a method of 
screening diseases, which is simple and quick and which can be performed by ordinary 
persons, conducted by the present inventors aiming to overcome the above-mentioned 
problems, resulted in the finding that a disease-specific serum proteome standard can be 
obtained by fransforming separated states of serum proteins on 2D-gels into images in order 
to facilitate distinction of modified proteins through separation of proteins contained in a 
serum sample in two dimensions, and mining the 2D-gel images using an image mining 
technique, and that such standards are useful in developing a simple and economical method 
and system of screening and classifying some specific types of cancer. 

It is therefore an object of the present invention to provide a method of analyzing 



5 with high sensitivity and specificity. 




iefcy-is-4isadv^tageous-k-terms^f4ow-a^ 



diagnosis, as well as being not economical. 
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cancer using a proteome image miring technique, which facilitates early cancer detection, by 
collecting a plurality of serum proteomes from nonnal individuals and diseased individuals and 
transforming 2D-gel patterns of the serum proteome into twcxlimensional images, producing 
serum proteome standards using an image niining technique and constructing a database 
5 consisting of the proteome standards, obtaining a 2D-gel image of the serum proteome from a 
subject organism, and comparing the image of the subject with a plurality of the serum 
proteome standards stored in the database. 1 

It is another object of the present invention to provide a method of finding 
characteristic patterns of serum proteomes from diseased individuals, and distinguish them 
10 from those of normal individuals, by applying an image-mining tool to two-dimensional 
images of serum proteomes. 

-IUs-stilJ^otber-ob^ 



analyzing cancer using a proteome image-mining tool, which makes it possible to obtain 
precise analysis results by analyzing serum proteomes using the image-mining tool 
15 employing a genetic algorithm and a support vector machine, and to follow-up the 
progress and prognosis of disease states by a frizzy rule-based classification step. 

It is a further object of the present invention to provide cancer-specific screening 
biomarkers, that is, proteome patterns, which provide great influences in cancer detection 
when such a method or system is apphed. 



20 



BRIEF T)F.SHRlPiTON ™? THE DRAWING 



The above and other objects, features and other advantages of the present 
invention will be more clearly understood from the following detailed description taken in 
25 . conjunction with the accompanying drawings, in which: 

Fig. 1 is a block diagram of a system for cancer analysis according to the present 
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invention; 

Fig. 2 is a detailed block diagram of a proteome standard production means shown in 

Fig. 1; 

Fig. 3 is a flowchart iUustrating a method of cancer analysis according to the present 
5 invention; 

Fig. 4 is a flowchart iUustrating a method of producing the proteome standard shown 
inFig3; 

Fig. 5 is aphotograph illustrating a two-dimensional image of a serum proteome; 
Fig. 6 shows a process of producing a proteome standard after the input of serum 
10 proteome; 

Fig. 7 shows an optimal parting plane detennined by a support vector machine; 

Fig 8 sh ows_a Jraining_g^^_bi e ast- cancer^eteGtiGn-usifig^-^ort-vector- 

machine and a genetic algorithm; 

Fig. 9 shows a testing step of breast cancer detection using a support vector machine 
15 and a genetic algorithm; 

Fig. 10 shows a result of practical diagnosis of breast cancer in which 26 spots are 
used as the optimal feature data; and 

Fig. 11 shows a result of practical usage ofbreast cancer screening in which 48 spots 
are used as the optimal feature data 

20 

BEST MODES FOR CARRVTNG OUT THE TNwatiom 

The present invention is directed to a method of analyzing cancer, comprising the 
steps of: transforming inputted serum proteomes from normal individuals and individuals 
25 having cancer into two-dimensional images, extracting feature data from the- images, 
generating a proteome standard having a disease-specific proteome pattern by computing 
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optimal features capable of distinguishing the two kinds of serum proteome from each of the 
feature data, and constructing a database consisting of the proteome standard; inputting a 
serum proteome from a subject of interest, transforming the serum proteome into a two- 
dimensional image and extracting feature data from the image; and comparing the structure 
5 of the serum proteome pattern of the subject with the proteome standard having a disease- 
specific proteome pattern and determining whether the serum proteome of the subject is 
normal or abnormal, that is, indicating the possible existence of cancer, or cnscriminating the 
type of cancer, based on the comprised results. 

In addition, the present invention provides a system of diagnostic screening of cancer, 
10 comprising an input means for inputting serum proteome; a proteome standard production 
means for generating a proteome standard having a disease-specific proteome pattern by 

bnmng-receiwd^er^-prateem^ 
into two-dimensional images and extracting features from the images, and extracting optimal 
features capable of distinguishing the two kinds of serum proteome from each of the feature 
15 data, and transforming a serum proteome of a subject into a two-dimensional image and 
extracting features from the image; a proteome comparison means for mapping the serum ' 
proteome pattern of the subject, extracted by the proteome standard production means, with 
the proteome standard pattern to determine similarities between the two patterns; a disease 
analysis means for estimating the serum proteome of the subject as 'normal' if the serum 
20 proteome pattern of the subject is similar to that of the normal individuals, and otherwise, as 
'having cancer', based on the mapping results by the proteome comparison means; and an ' 
output means for outputting the analysis results by the disease analysis means. 



25 



Definition of terms 

If not defined elsewhere, technical and scientific terms used in the present 
specification have the meanings commonly understood by those skilled in the art. 
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The term "biomarker", as used herein, refers to a polypeptide differentially present in 
serum samples from individuals having any disease, compared to that from nonnal 
individuals. Such a biomarker or biomarkers may comprise a single polypeptide or two or 
more polypeptides. In addition, the term "differentially present" means that a specific 
5 polypeptide in a serum sample from an individual having any disease has an increased or 
reduced expression level, or is newly present or absent, compared to a serum sample from a 
nonnal individual. 

The term "proteome pattern", as used herein, means a characteristic group or 
grouped form of polypeptides differentially present in a serum sample from an individual 
10 having any disease, compared to a serum sample from a normal individual. Typical 
examples of the proteome partem include a group of serum proteins showing specific 

mmxkauma*^^ 

two dimensions. Tn addition, the term "disease-specific proteome pattern", as used herein, 
refers to a group of serum proteins specifically appearing according to the kinds or types of 
15 diseases, or a grouped form of the serum proteins. Such a proteome partem is used as a 
marker to detect diseases and identify the kinds or types of diseases using the method and 
system according to the present invention. 

The term "feature data", as used herein, refers to the data of a serum proteome, 
capable of distmgrishing diseased states through comparison of serum proteomes from 
20 normal and diseased individuals. In detail, the feature data includes data of spots 
corresponding to serum proteins specifically pr&ent on two-dimensional images of serum 
proteomes from diseased individuals. For example, the feature data may include a group 
(combination) of spots, mass of each of the spots, and/or an isoelectric point of each spot. In 
addition, the term "optimal feature data", as used herein, refers to optimal data capable of 
25 specmcaUyistrnguishing diseases among the feature data. In detail, the optimal feature 
data includes optimal combinations among combinations of disease-specific spots. 
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The term "data mining", as used herein, as a process of discovering useful 
correlations hidden in a large volume of data, refers to a process of identifying new data 
models derived from the data of the databases, which are previously unknown, and of 
extracting practicable information in the future and using the information for estimation. 
5 That is, "data mining'' means to discover valuable information by finding patterns and 
relations hidden in the data. 

The term "genetic algorithm (GA)", as used herein, which deals with the ability of 
living individual to adapt to their environment by technologically modeling mechanisms 
associated with heredity and evolution of living individual, and refers to a technique of 
10 generating much better solutions by expressing possible solutions for problems as a data 
structure having a predetermined form and then gradually modifying the data structure. In 

niore^etail,4he-geneti<^gorife 

at a high speed to derive a maximum or minimum value of a function f(x) for a variable x 
defined within a certain range. The genetic algorithm typically comprises the steps of 
15 determining genetic types by performing coding work of transforming gene elements into 
symbol strings; determining an initial genetic group by generating a variety of individuals 
having different genetic elements from the genetic types determined at the step of detennining 
genetic types; evaluating adaptability of individuals by computing adaptability of each 
individual by a predetermined method; determining survival distribution of individuals based 
20 on the adaptability determined at the step of evaluating adaptability, mating by exchanging 
gehes between two chromosomes to generate new individuals; inducing mutagenesis by 
forcibly changing a portion of genes and thus maximizing diversity of a genetic group to 
generate individuals having much better solutions; and returning to the step of evaluating 
adaptability of each individual. Since the genetic algorithm finds solutions through mutual 
25 cooperation between a plurality of individuals by gene manipulation such as selection or 
mating, much better solutions are easily discovered. Also, the genetic algorithm has an 
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advantage in that its operation is easy. 

The term "support vector machine (SVM)", as used herein, which is a universal 
learning machine useful for pattern recognition, whose decision surface is parameterized by a 
set of support vectors and a set of corresponding weights, refers to a method of not separately 
5 processing, but simultaneously processing a plurality of variables. Thus, the support vector 
machine is useful as a statistical tool for text classification. The support vector machine 
non-linearly maps its n-dimensional input space into a high dimensional feature space, and 
presents an optimal interface (optimal parting plane) between features. The support vector 
machine comprises two phases: a training phase and a testing phase. In the training phase, 
10 support vectors are produced, while estimation is perfonned according to a specific rule in the 
testing phase. 

-The.methQdfQr_disease analysis andsystem-using-the method according to^presenr " 

invention will be described in detail with reference to the accompanying drawings. The 
method and system for disease analysis are useful for diagnosis of a variety of diseases, but in 
15 the present invention, their application to cancer diagnosis is illustrated. 

Samples, useful for standard generation and disease analysis include biological 
samples which may contain disease-specific polypeptides, which are exemplified by serum, 
urine, tears and saliva. In particular, serum proteomes from all mdividuals having genes are 
used as biological samples, but in the present invention, serum proteomes from humans are 
20 illustrated. 

In the present invention, cancer means a pathogenic state caused by "uncontrolled 
ceD growth". Examples of cancer include breast cancer, ovarian cancer, stomach cancer, 
liver cancer, uterine cancer, lung cancer, large intestine cancer, pancreatic cancer and prostate 



cancer. 

25 



System for disease analy sis using an image minin ff teHiniqn,. 



10 




WO 03/102589 • 

PCT/KR02/02427 



Fig. 1 is a block diagram of a system of analyzing cancer according to the present 
invention. 

As shown in Fig. 1, a cancer analysis system 10 comprises a proteome standard 
production means 102, a proteome comparison means 104, a disease analysis means 106, an 
5 input/output interface 108, a controlling means 110, an input means 112, an output means 114 
and a database 116. 

The proteome standard production means 102 receives serum proteome from N 
numbers (eg., 20) of normal individuals and N numbers (e.g., 20) of diseased individuals 
through the input means 112, transforms the serum proteome into two-dimensional images 
10 (see, Fig. 5), extracts features, namely, specific spots, and distinguishes optimal feature data 
from the extracted feature data, while extracting and normalizing correlations between data 
conastingj)^ 



116. For performance of such functions of the proteome standard production means 102, a 
genetic algorithm, a support vector machine and a fuzzy rule-based classification system are 
15 available, which will be described in detail, below. According to the operation of the 
proteome standard production means, features (intensity, size, etc.) of serum proteomes from 
individuals having cancer, different from a serum proteome standard of normal individuals, 
are discovered, and particularly, use of a fuzzy rule-based classification system allows to 
clarify the progression status and . future prognosis of cancer and other diseases to be 
20 monitored. ' 

In addition, the proteome standard production means 102 transforms serum 
proteome of a subject of interest as well as of normal and diseased individuals as standards 
into two-dimensional images, and extracts feature data from the images, and the resulting 
feature data are used in a process of analyzing whether a subject has a specific disease or not, 
25 After the feature data of the serum proteome of a subject are extracted by the 

proteome standard production means 102, the proteome comparison means 104 determines 
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When a pattern of the senam pmftome of , ^ similart0 „ ^ tf ^ 

pmteome of normal indmdarta, the disease anaiysh means ,06 debmnines ft, ^ 

5 I— W.^.fc^.taj,^ Theeahmafroo 
aeaotta are ouiputteu by the output meane 114 . Her ^ wta fc ^ fc ^ 

iavang cancer, progression s«es and prognosis of canoe, is predict, „, „, 
resuifsamotuputted. Also, in ease fta, fte subject does no, bave cancer a. paoseatt, the 
probability of nature canoer development oan be predion and o«pu«ed. To parfomr ^ 
10 frrncttons, the proteom, sbmdard pmducuon means 102 should produee . standanl data using 
a fuzzy rule-based classification system. 

l^in p ul/output inters JffiL is-for-conneetmg-^^ - 



operauon of each functional means as described above. 
15 The cancer analysis system 10 according to the present invention may further 

of normal individuals and individuals having cancer, who donate their serum proteome to be 
used as standards, and of personal information of subjects in the database 116 in a coded form. 
Fig. 2 is a detailed block diagram of the proteome standard production means 102 
20 shown in Fig. 1. 

The proteome standard production medha 102 includes a pre-processing means 210 
for obfcining meaningful feature data from the twcHBmenaiona! images of sennn proreome- 
an evolutionary damnation means 220 for identifying normality of sennn proteome of a 
subject fan the feature data obtained by ft. pm-processing means 210; and a toy naie- 
* ba*d Creadon means 230 for estimation of mom detailed sbttea of ft. sennn proteome 
of a subject from fte feabrm dab, obtained by the pm-pmoessing means 210 employing 
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experimental knowledge, statistical tools, etc. 

Tie pre-processing means 2.0 inckdes an image processing means 212 md , 
.fcaane enaction means 2,4. The image pressing mema 2,2 performs genera, ^ 
processing worits, inc,nding noise fflt erin s image enhancement ortto-projecnon, edge 
5 defection and optima, threading, mom me np«ted two^imensona, images, whiie the 
featirre extinction means 2.4 extiacts basic featiues, name,y, diseaae-speoifio spots fiom the 
tmage-procested ^dimensional images. Earn ft**, .xtaetod by mo fe^ 
meana214is discriminated orU,efed,taprodneingfeamre data for spots. 

lie evolutionary damnation means 220, which is a means for analyzing patiems 
10 of aenmr proteomes mom normal or diseased individuals using me data obtains, by the pre- 
processing me** 21 „, . QA ^ ^ ^ ^ ^ ^ 

J.YM^«v^ BK j^^ wm ^^^^. 5£ga „_. 

-ongc.mbmationsofdiseose-specincspots. TheGApmceaamg meana 222 draorimmatea 
optima, f^ playmg . ^ rofe „ ^ fc ^ ^ ^ 

15 ^^)^edbymepre- P mo KS mgmean S 2,0,„hi,emeSVMap P hcationmeana 
224 estimates fidemy of me optima, fc*. da* discriminated by the OApmceasiog means 
**—*»m*« ammt t**** al „ atx rhus, possft,e ^ of tite nex, 
S^eeation are produce* and thmugh such « evCution method, optima, ftatine data m d 
eatimau m& nctio m ae=ording to mcde tt ea 1 beg M cd. Heaeiu, me estimation function 
20 ^bymeSVMapphcationme^224isapmde te mm,edlnncticm. 

In accordance warn me pres« invention, featores of 5 to 20 infonnation pieoes can be 
effective* extiacted using a genetic a,gorimm. The evoMonany Cassation means 220 
«t»c b a phuahty of features cnpabl. of easdy and etiective,y cancer 
25 drseases. Inter, by compaing features exbncted mam a test S amp,e hum a subject warn a 
Pteafty of features as described above, whether the subject has cancer can be detomtined 
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ma™ 234 arranges and nom** «ba mafite obteined by ft. daa mappmg means 232 
thereby gaaesanng a final nfie baa, n. fey Me . 5ased J 



Storing of progression m d pm^ of ^ ^ ^ - 

experimental methods by an expert system, as well as staple detection of cancer. 
IS The matted for disease analysis according to the present invention will be described 

comprises «be steps of: generadng a precoma stendam having a dWspecific pmteome 
pattern and consfiucdng a datebase conatarg of me proteonre stendard (training ^ and 
aataating whether a sennn proteonre o, the subject is nonnal or indicoove of a ^cifie 

*> ^^^^^tanaenanp^nteofaaobj-ectofinte^^eonrparing 
daefea^dateoftheanhjeeawanrfiredise^^ffle^^^^ fc 

detar,thememod for cancer analgia may bo performed b, a prt.g™ stered in memem01y 

*i addition, a progmm composed of inahncfion wot* executable by a digite, processing 
* devrco is ^y re**,, and » p ro ^ fa ^ ^ fc 
evolohon^y Caasificafion step and the tuzzy Abased claasifioadon step according „ „ 
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present inaction ^« klM „ ^ ^ „ y , ^ 

dev.ce. Mdreova, „ ^ be ^ „ ta4w , fcfflW fc ^ fa ^ 

can be realized in the from of a software, FPGA, ASIC, etc. 

5 

frodncaon ^i^,^ havh ,. . „, ^ f 

(training aSfep; t <gi) 

Fig. 3 ia a flowote, ntatodng a. nreftod of cancer analgia according „ „ 
present inv.don, wMie Kg. 4 ia a dowcna, — g . melhod „ ^ , 
10 standard shown in Fig. 3. 

A dWspecinc proteotn. stondan, of da. preaen, „ by 

co-Pann. pretax n. na.o.^^,^^,^^-..-^,-. 

ntdtvidualsanddtennndingadia^apadficp^^^ 

^^^proteomes^infe^^,^ 
for ^ vector ^ , Be4 Dm ^ fc ^ ^ ^ 

^^^-^-otaofd^andre^^proceaaofid^^ 
data mode* denved the data of the databases, which are previous,, ^ md 

^^-^ind.itareandnanag^n^adonfaeadn^ ^ 
^■^^»^va« 1 e^ Mby ^^^^ hidtai ' 

» >n the data Data mining can be apphed for hnage ana,**, which ia a too, to cdac 
Paderea fen, digits picdn« and used in diverae inc Iudtag recoslMon j 

charactem, medical diagnosdcs and the defenae industry. 

A method of discovering dis^apecide proteome paaema inched* the steps of 
racing serem preteomea from norma, a* diseaaed inmvidna,, b y the input mcana „ 2 
CS.01* end s^mg each of the aernm prateones „ . ^ ^ ^ 
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102, spacing draease-spacific feate (especMy> ^ ^ ^ 

featore * among to extacted ^ dMa to ^ ^ ^ ^ ^ ^ 

-I, to a d-*-o, WI ^ A process of producing a preteome staadard ^ be 
descnbed in more detail with reference to Figs. 4 to 6, as follows. 

H» anmysto of proteomes to the present invention may be performed by me 
convendona, memods too™ „ ta ^ fc ^ ^ ^ by ^ ^ 

method The SD-ge, ^ matod ^ m ^ ^ ^ . 

^^^P^innre^towhschpref^^p^^^^ 
charges fjsoefeedfo fWin s fEF), and ^ Ee parefcd by ^ 
.0 (SDS-PAOE). to one pres., tovenhon, a.™ pmtetos ta ^ md ^ 

protems iom diseased individuals are separated on 2D-gek 
Hg. 5 is a ph ptographiUn^^ 



^ oiiuwu m rig. a 

separated padem of serem prMeins . . ^ „ ^ ^ , 
to process the protein pattern on a 20^ into an analyse form. Then, dWapeeine 
15 features are extracted from the imago toformabon of the serum proteome, ttnnsfonned info a 
digital toformabon fo^af. and stored aia, is, spsoifio featores common to twc- 
dmt^onat image toformabon of m prDtemes to , pJmlity of ^ ^ ^ 
individuals are extracted, and each data item (coordinate, molecular weight, isoelectric point 
ete.) of toe featores are stored to conattnc, a datobase. For example, a database may be' 
30 generated by aortog tofomratton (coordinate, mofecnJar weigh,, isoetoctric point, ate.) of 

magna of semm proteome* from todMduafc having a specifc disease with fwo-dim^onal 
mages of aenm, proteomes from noireal todividoals. to adddton, , database may b. 
consented by extencttng common features between two-dimensional image toformabon of 
* sennn preteom* fmm a plummy „, diseased individual, and storing each date item 
(coordmate, molecular weight, isoelectric point, etc.) of the features. 
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The diaease-speotnc feate^ mMn ^ w havtag 
and rize among apote fa ^ ^ by ^ ^ ^ 

molecular weight 

For exatnpte, ^ ^ ofmm ^ ^ ^ 

5 . la*e number of pmte™ „ evamated, md , ^ numbor to — ^ ^ 

pmteins. Among the ^ Mme pzotcins are extracted as cancer biotnatfara ^ of 
enbcove ly denting a specima cancer, proiuchs . ^ ^ 

When ana!^ a pharali* of seram p^mes ^ breaa , ^ rfw 

^eapo.aar.aelec.ed.whichshowfeahneaap^e.ohr^oanee. The specific spo* 
10 .!*.,«*, below, h wbich ^ ^ ^ ^ jsodKWc 

indicated 

OpJjnM&atp^ 

1 data fivM-n cma^ ;„ j i • . _ . 



^ovuic uaia, wnue correlations 

b«ween data fiom spots m toagB „ ^ ^ ^ 

consfiuct a database. Fortes a genede aigonta, asuppori vee t orn M eh i ne( S v M)mda 
IS ^nde-baaed e^^ ^^^^ ^ ^ ^ ^ ^ 

beconre pmteome pafiems of Mvidoals having a specific disease, distinguishable fa, fc, 
of nornml mdividuab, thus generating disease-specific proteome patterns. 

it detail, each of various combinadons of spots hated in Table 1, above, may give a 
breast cancer-specific pmteome pa^ ^ j, . rf ^ _ ^ 

20 spoteselectednonaapoiahatentnT*, used as a breast cancer-specmcpa^on 
dagnosfic sheening of dte braes, cane*. Jfeein, to select' one or mora w me « one 

ormorecombinationsofthetotal67spots,thatis, 
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TABLE 1 



No. of 
spot 



Molecular weight 
(kDa) 




With reference ,o Hg. 4, ,„ attain naeaningfiu f8atoe ^ ^ 
-age, of senmyaroteomes, to ^ ^ ^ ^ ^ 

messing. step (S20.) and « evolutionary creation step (S202-S204). To enaWe to 
«*» ttie proton state ^ of ^ by aatjafaj ^ 

«P«to=*l methoda as well . simp,. ^ of ^ fc ^ rf ^ 
protect standard may iinme, comprise fuzzy nde-based cl,^^ 

In addition, the pre-processing step (S201) includes me atepa of prooesaing images 
- extiacting featirras. At me image prooaatsing step, genera, image paooeaaing wotfa, 
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tncluding noiae filtering to^ _ ^ ^ ^ ^ 

. to feahuo exhaction step, baric featoes „ . ^ f „ m ^ ^ ^ ^ ^ 

pnKesaed two-dintenadonal hnages by the fealure meam m ^ rf fc 

5 feaduea exdacfcd „ te ^ ^ ^ . diMrimfaitai ^ ^ ^ 

feature data for spots. 

At the evolutionary classification step, patterns of serum proleomes fiom nozinal or 
leased htdividua!* „ classified nsing the data obtained by the pre-processing step ^ 

10 a SVM (support vector ntechanistn) application step (S2 05), aa well as a step (S204) of 
eadacdng opfitnaf feafine data and estanon tocta ^ „ fc ^ ^ ^ 

oiscrtnnna^ at the OA r~r^ ^^ tte ^^-„-^-- A --- 5r - 

ptocesring atcp (S202), spots having opntna. feafiues playing a crifica, b classified 
rf *--^^»«-**---- t fc ( .* fc « lM1-l , fcaAllBlli ^ 
15 means 222 a, the pre-processing sap. At the SVM application step (S203), fidelity of the 
ophmal fi^re da* dteinunatrf a, da. GA processing step is asfitnatcd by the SVM 
apphcation nteans 224 using decision funcuona and cUssificarion o™ a. „ 
atanadve for spots of the netrt generadon k produce<1 ^ ^ ^ ^ ^ 
method, opnntal feanne daa and eadmadon functions acceding to the daa can be 

20 general). H^dtecsthn a donfi J nc a onsuacdbyu,aSVMapphcadonnteana224 
are predetermined functions. v 

Hg. 6 shows a prcaeas of producing a ptoteonta standard Son, the preprocessing 
step to the evolutionary clarification step. Vsing ^ of ^ ^ 
(100) of diseased individuals and coUecdon of aetunt proteonte toages (200) of nonnal 
25 mdtv,duala,taagepre-p ro ce Ssi „ gis p Mfcmied(300) Dis^-^m m ^ cted 
fion, tire ptoteonte images of diaeaaed and norma! indies, and disease-specinc ape* m 
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hemmed according to . thei, inttosity ^ ^ ^ a Mg ^ . 

coated (400). ApMtyoffe^extaCedcsdcscn^abov^veSormorc 
fr«-spo*,andp re f.mbly,5 tol <»Wspo B . F^daufdWspecificsp^of 
tire to generation are applied to a support veotor ^ (500)> ^ ^ 

5 ^dauandeartatfonftac.ionsacc^g.o,,,,^ h adm«o„,fi >rsenmprateone 
maeges in *. second and N generations gen^ by inducing mating ^ „ f 
senea by a genedo tdgorifbn, dr. asm. pmcess as in the drat genemrion ia « (600 and 
700), thus giving final optimal features and estimation functions. 

Hg 7 shows an optimal pardng plana detained by a ^ ort vector macbine fa 
10 whtch so optimal inte^oe, namely, optimal fading pl ane, is drawn by cordons among 
features firm, apn, ^ 2 ^ ^ wtaein fte ^ ^ ^ ^ 
proteome ima ges. 
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25 



A fitzzy ^ based ctossaMon sfcp _ which . , ^ fc 

^ntcognitionacoumcybyexdacdngp^^ which can be easily missed in 

Ore evolutionary claaaificadon s*p. f„ r ^ ^ ^ ^ by 

taadsdca, and experiment*, methods, cnpris* . data mapping step (S205), a nde-basad 
classificadon step (S206) and a atep of pmducing a nde base (S207) based on da. two steps 
Atthedata mapping step (S205), conaWoua between spots fi„ m d™^^ taagK of 
serum pmteome „ by . ^ ^ ^ ^ fc ^ ^ 

clasatfied b, a statistical technique and sZatistico! inaccumey ia quantified using a fuzzy 
•clique. At fire nde-baaed daasificadon 'step (S20d), fhe result* obtained by me data 
mapping am «-^d and nonnahzed by . nde-based claaaificadon means 234, .hereby 
genemfmg a finaf nd. baae (S207). He fuzzy mj^ cla!sificali0n ^ „ M ^ 
or due presem inveution, bul its applicatioll „ fc ^ ^ ^ ^ 

progression and prognosis of diseases through sfadadcaf and experimental m«hoda by an 
expert system, as well as simple detection of cancer. 
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The process of producing a proteome standard according to the present invention, 
comprising the steps of extracting features (disease-specific spots) from image information of 
. serum proteome from N numbers (e.g, 20) of nonnal individuals and N numbers (e.g., 20) of 
diseased individuals, and then producing a proteome standard by computing optimal features 
5 from feature data, may further include a step of estimating more detailed information of two- 
dimensional images of serum proteomes from subject individual by employing experimental 
data, a statistical method, etc. 

Fig. 8 shows an application of a process of producing a proteome standard having a 
disease-specific proteome pattern to diagnostic screening of the breast cancer (diagnosis) 
10 (training step). As shown in Fig. 8, through analysis of two-dimensional images of serum 
proteomes from 30 nonnal individuals and 30 individuals haying a breast (specific) cancer, 
infcnna&B^-^^ 



searched using a support vector machine and a genetic algorithm. 



15 



Estimation of development of a specific dk Pa se through coin^ ngon " f 
.proteome of the subject with a diseas^n ecific proton™ standard rtretfnp ^p. gg) 

After producing a standard by analyzing serum proteomes from normal and diseased 
individuals, as described above, a serum proteome of a subject of interest is inputted by the 
input means 112 (S103), and feature data are then extracted by the proteome standard 
20 productionmeans 102(S104). In more detail, serum proteome of a subject is separated on a 
2D-gel according to the same method as in the image pre-processing step for production of a 
proteome standard, and the resulting 2D-gel image is transformed into a digital information 
format Basic image processing works, including noise filtering, image enhancement, 
ortho-projection and edge detection, are performed for the two-dimensional images of a 
25 subject, and specific data as proteome patterns are (hen extracted. The resulting proteome 
patterns are used for comparison with the disease-specific proteome standard. 
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Estimation of whether a subject of interest has cancer through comparison of the 
proteome of the subject with a disease-specific proteome standard is achieved by performing 
astep of comparing the two proteomes (S105) and a step of determining whether the subject 
has cancer or not (S106). The results of analysis of cancers are displayed by the output 
5 means 114 (S107). 

At the step of comparing proteomes (SI 05), the structure of a serum proteome 
partem from a subject of interest is compared with the disease-specific proteome standard 
stored in the database 116 by the proteome comparison means 104, and whether serum 
proteome of the subject is normal or abnormal is analyzed by the disease analysis means 106. 
10 When more detailed states of serum proteome of the subject are stored at the training step 
using a fuzzy rule-based classification means employing experimental knowledge, a 

subject can be determined. 

A pattern matching step is performed to screen the cancer, which may further 
15 comprise a fine classification step in the case that a fuzzy rule-based classification means is 
applied at the. training step. At the pattern matching step, classification into "normal" or 
"having a disease" is performed using a support vector machine by applying features and 
estimation functions, extracted upon producing the proteome standard, to the pre-processed 
serum proteome of a subject of interest In addition, at the fine classification step, fine 
20 information including correlations between spots are deduced by projecting the pre-processed 
serum proteome of a subj ect to a rule base produced at the fuzzy rule-based classification step. 

The support vector machine (SVM), as defined above, comprises two steps: a 
training step and a testing step. At the training step, data vectors are inputted from a training 
set. In the present invention, the step of inputting results of pre-processing of serum proteome 
25 from N numbers of normal individuals and individuals having cancer corresponds to the 
training step. Then, the input data vectors from the training set are transformed into a multi- 
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dimensional space, and parameters for support vectors and weights are determined. At the 
testing step, data vectors are inputted torn a testing set, and the input vectors from the testing 
set are transformed into a. multi-dimensional space by data matching Then, a classification 
signal is produced from an optimal parting plane representing states of each input data vector. 
5 That is, whether the input data vectors from the testing set are normal or abnormal is 
determined. 

Kg. 9 shows a practical application of the step of estimating whether a subject has a 
disease through comparison of the proteome of the subject with a disease-specific proteome 
standard (testing step: S2). Based on decision models for breast cancer, produced at the 
10 training step (SI), a test set consisting of 33 cancer patients and 35 normal individuals was 
tested. 



JnjLpref OT ed*mbodta^ 

interest and analysis results are stored in the database 116, which are useful for later analysis 
of other proteomes. 

15 In the following example, the system and method for disease analysis according to 

the present invention are applied to practical cancer screening. 

EXAMPLE 1 



20 



After training two-dimensional images of serum proteomes from 30 breast cancer 
patients and serum proteomes' from 30 normal individuals, a test was performed for 33 cancer 
patients and 35 normal individuals. Such test through analysis of Ser um proteomes was 
found to have an accuracy of 94.11%, a sensitivity of 100% and a specificity of 88.57%. In 
this test, 26 spots were used as optical feature data of breast cancer. The 26 spots were 
25 selected from 67 breast cancer-specific spots listed in Table 1, above. The results are given 
in more detail in Fig. 1 0, in which accuracy means a degree of correctly estimating real breast 
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cancer, sensitivity means rate of correctly identifying positives itself and specificity 
degree of distinguishing breast cancer from other cancer diseases. 



means a 



can 



It will be apparent to one skilled in the art that various changes and modifications 
5 be made in the present invention without departing from the spirit and scope of the present 
inveution. It will be understood that the above example is described in an mustrative manner 
and is not to be constructed to limit the present invention. Therefore, it is to be understood that 
the scope of the present inveution will be shown by the Mowing claims rather than the above 
detailed description, and all modifications and variations of the present invention fall within 
10 the scope of the appended claims. 



^J^bM^^or^acao^ wi^e^entinvMuon; me ^5^5" 

method for disease analysis facilitates cancer screening by extracting features corresponding 
to disease-specific spots by applying an image mining technique to serum proteomes from 
15 normal and diseased individuals, constructing a database consisting of the features, and 
comparing the serum proteome of a subject of interest with proteome standards, thereby 
allowing early detection of cancer states, * addition, by introducing a fuzzy rule-based 
classification method, the system and method for disease analysis can monitor progression 
status and future prognosis of cancer diseases, thus making it possible to perform medical 
20 treatment suitable for pathologic states of patients. 
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CLAIMS; 

Q'A system of analog cancer, using a proteome image mining tool, comprising: 
input means for inputting serum proteome; 
5 protege stated p^ducdon mems fa gmeratjng , ^ ^ 

Wom.bg received «— ^ . pM(y rf ^ ^ ^ 

Mi two-dime^onal images ^ ^ „ ^ ^ fc ^ 

and distinguishing „ ptim! a ^ ^ ^ fc ^ ^ ^ ^ ^k^" 

« proteome of a anbjec. into a rwc-dimenaiona, tag. and eamcdng feah^ fa. to 
10 image; 

ptoteome comparison means for mapping me semm p ta ^ of fte ^ 

determine simijarily between the two patterns; 

disease analysis means for estimating tb, semm pmteom. of me subject as •nomrf 

. and otbntwiae, as 'having CW, ba*d on the napping by proteome comparison 
means; and 

output means for outputting the analysis results by the disease analysis means. 
20 Qlne system as set forth in claim 1, wherein the cancers are selected from the group 

cancer, large intestine cancer, pancreatic cancer, or prostate cancer. 

3. The system as set forth in claim 1, wherein the proteome standard is one or more 
25 selected from spots listed in Table 1. 
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• 4. The system as set forth in any of elaims 1 to 3, wherein the proteome standard 
production means distinguishes the feature data by extracting and normalizing correlations 
between spots comprising the proteome standard. 



J>. The system as set forth in any of claims 1 to 3, wherein the disease analysis means 
predicts progression status and future prognosis of cancer when the subject is identified as 
having cancer, and, in case that the subject does not have any cancer, predicts probability of 
cancer development. 

L0. The system as set forth in any of claims 1 to 3, wherein the system of analyzing 
cancers further comprises coding means for coding personal information of normal 
_i?divid^ 

standards, and ofpersonal information of subjects. 

G The system as set forth in any of claims 1 to 3, wherein the proteome standard 
production means comprises pre-processing means for obtaining meaningful feature data 
from the two-dimensional images of serum proteome, and evolutionary classification means 
for identifying normality of a serum proteome of a subject from the feature data obtained by 
said pre-processing means. 

a The system as set forth in claim 7, wherein the proteome standard production 
means further comprises fuzzy rule-based classification means for extracting correlations 
between spots contained in the serum proteome from the feature data obtained by the pre- 
processingmeans, and classifying the extracted correlations by a statistical. method ' 



g/The system as set forth in claim 8, wherein the fuzzy rule-based 



classification 
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mean, copses dati, mapping moans f„ r confuting conotations between spots fiom to 
twc-dimensiona, irnagos of aennn pmtcme, classifying to fratuKS „ y , 

techntooe, and quantifying aatisfcal ^ , ^ ^ 

* classmen.™ Ma™ for ananging and nonnaUzing to res* obtain* by ^ ^ ^ 
5 means, and thus generating a rule base. 



® Ha syaen, aa sat forth in ohum 7, wherein to pra-pn.caasing means comprises 



noise 



nnage procossing means for performing general imago processing worics, including 
filtering, image ennancemcnt, ortho-projecuon, edge detection and optima, mmshdding, from 
10 me rwo^mensiona, images of sennn prote) me, and feahno extinction means for eating 
features of spots from me image^ceaaed ti^dfinensiontd tinges and labeling each of tiro 
features. 



means 



(T| The system as sa forth fa c|afa ^ _ 

15 comprises genetic atgorfthm processing means for discriminating optimal feeture data among 
to feamro data extioeW by to pre-pmcessmg ^ ^ ^ ^ 
apportion means for estimating fcfe^ of fc opdnm , ^ ^ ^ ^ 

genetic algorithm processing means naing estimation factions and a ctaasincation 



i error rate. 



§ A method of analog cancer diseases using a proteome image mining tool, 
comprising the steps of 

transforming inputted serum P roteo m es from normal indi^duals and individuals 

havmg cancer into two-dimensional images, extracting feature data from the images 

generating a proteome standard by computing optimal features from the feature data, and 

25 constructing a database consisting of the proteome standard (Step 1); 

inputting a serum proteome from a <nihWt ^ ■ * . 

v uie n-om a subject of interest, transforming the serum 
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proteome into a two-dimensional image and extracting feature data from the i 



and 



image (Step 2); 



5 



comparing the structure of the serum proteome pattern of the subject with the 
proteome standard and determining whether the serum proteome of the subject is normal or 
abnormal, that is, indicative of cancer (Step 3). 



(13,/The method as set forth in claim 12, wherein the cancer diseases are selected 
fiom the group consisting of breast cancer, ovarian cancer, stomach cancer, liver cancer, 
uterine cancer, lung cancer, large intestine cancer, pancreatic cancer, or prostate cancer. 

14. The method as set forth in claim 12, wherein the proteome standard is one or 
more selected from spots listed in Table 1. 



10 



(Q The method as set forth in any of claims 12 to 14, wherein the Step 1 further 
15 includes the steps of extracting relations between spots contained in the serum proteome 
tan the two-dimensional images of the semm proteome employmg experimental knowledge 
and a statistical method, and clarifying the extracted correlations by a statistical method. 

@,The method as set forth in any of claims 12 to 14, wherein the Step 3 further 
20 includes a step of identifying present disease states and estimating a future prognosis of the 
disease by analysing serum proteome of the subject. 



25 



3 The method as set forth in any of claims 12 to 14, wherein the Step 3 of 
identifying the existence (development) of cancer includes: 

a pattern matching step of classifying the serum proteome of the subject into 
"normal" or "having a disease" by applying features and estimation functions, extracted upon 
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spots, contained in the two^ensional proteome images. 

<ffi-Tne method aa se, forth in any of dahns ,2 ,„ ,4, ihrther comprising a sfep of 
conshneting a dative oonsMng of the ee™ proteone of ^ ^ ^ ^ 
(hereof wherein said step is performed after the Step 3. 

Q The method as se, forth in any of ctahns 12 ,„ ,4, when* the Step , „f 
10 producing a proteome standard comprises: 

. a p^rooeasmg ^ ^ m ^ ^ ^ 

. ^^^n^^ ^^^^,,^^^^^ _ 

. Proleonte images, and , feaPore e^don sfcp of exacting has* feata es ha spo, f„ m „ 

^'^-^on.hn.geaandprodncingfe^d^hy.henngeachofta 
15 extracted features; and 

a.evolndonaryc^o.^op,^^,^^^^ 

P-Pmoeasing *p, and enhachng opmna, f eaMre ^ md ^ ^ , 
«ng ndemy of are optima! feahue daa diacnnhnated hy «ha genetic algorithm hy a 
20 support vector machine using estimation tactions and chKstfiauon enor rates. 

V V 

(29.' The method as set forth in claim 19 wherein the i f ^ • 

i>, wnerein the Step 1 of producrng a proteome 

standard further comprises: 

'^^■^^ofcon^c^w^^^^^^ 
* dnn^ona, taa8K of !erum „ ^ - fc ^ ^ 

»» -mpnted featees hy a method, md ^ # 
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fuzzy technique; and 

a rule-based classification step of arranging and noimalizing the results obtained at 
the data mapping step, and thus generating a final rule base. 

21. (A) A biomarker or biomarkers for diagnosis of cancers, comprising a proteome 
pattern, wherein said proteome pattern is one or more selected from spots listed in Table 1 . 



22. The biomadcer or biomaikeis as set forth in claim 21, wherein the cancers are 
selected from the group consisting of breast cancer, ovarian cancer, stomach cancer, liver 
10 cancer, uterine cancer, lung cancer, large intestine cancer, pancreatic cancer, or prostate cancer. 
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FIG. 7 
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